PDF Processing Quick start Use pdfplumber to extract text from PDFs: import pdfplumber with pdfplumber . open ( "document.pdf" ) as pdf : text = pdf . pages [ 0 ] . extract_text ( ) print ( text ) Extracting tables Extract tables from PDFs with automatic detection: import pdfplumber with pdfplumber . open ( "report.pdf" ) as pdf : page = pdf . pages [ 0 ] tables = page . extract_tables ( ) for table in tables : for row in table : print ( row ) Extracting all pages Process multi-page documents efficiently: import pdfplumber with pdfplumber . open ( "document.pdf" ) as pdf : full_text = "" for page in pdf . pages : full_text += page . extract_text ( ) + "\n\n" print ( full_text ) Form filling For PDF form filling, see FORMS.md for the complete guide including field analysis and validation. Merging PDFs Combine multiple PDF files: from pypdf import PdfMerger merger = PdfMerger ( ) for pdf in [ "file1.pdf" , "file2.pdf" , "file3.pdf" ] : merger . append ( pdf ) merger . write ( "merged.pdf" ) merger . close ( ) Splitting PDFs Extract specific pages or ranges: from pypdf import PdfReader , PdfWriter reader = PdfReader ( "input.pdf" ) writer = PdfWriter ( )

Extract pages 2-5

for page_num in range ( 1 , 5 ) : writer . add_page ( reader . pages [ page_num ] ) with open ( "output.pdf" , "wb" ) as output : writer . write ( output ) Available packages pdfplumber - Text and table extraction (recommended) pypdf - PDF manipulation, merging, splitting pdf2image - Convert PDFs to images (requires poppler) pytesseract - OCR for scanned PDFs (requires tesseract) Common patterns Extract and save text: import pdfplumber with pdfplumber . open ( "input.pdf" ) as pdf : text = "\n\n" . join ( page . extract_text ( ) for page in pdf . pages ) with open ( "output.txt" , "w" ) as f : f . write ( text ) Extract tables to CSV: import pdfplumber import csv with pdfplumber . open ( "tables.pdf" ) as pdf : tables = pdf . pages [ 0 ] . extract_tables ( ) with open ( "output.csv" , "w" , newline = "" ) as f : writer = csv . writer ( f ) for table in tables : writer . writerows ( table ) Error handling Handle common PDF issues: import pdfplumber try : with pdfplumber . open ( "document.pdf" ) as pdf : if len ( pdf . pages ) == 0 : print ( "PDF has no pages" ) else : text = pdf . pages [ 0 ] . extract_text ( ) if text is None or text . strip ( ) == "" : print ( "Page contains no extractable text (might be scanned)" ) else : print ( text ) except Exception as e : print ( f"Error processing PDF: { e } " ) Performance tips Process pages in batches for large PDFs Use multiprocessing for multiple files Extract only needed pages rather than entire document Close PDF objects after use

pdf processing

安装

Extract pages 2-5